library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v readr 2.0.2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Warning: package 'tibble' was built under R version 4.0.5
## Warning: package 'tidyr' was built under R version 4.0.5
## Warning: package 'readr' was built under R version 4.0.5
## Warning: package 'forcats' was built under R version 4.0.5
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(RColorBrewer)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Aviation saftey is a topic that hits close to home, as I recently lost a co-worker in an airplane crash. For this project, my goal is to explore airplane crashes worldwide to find out how often they are occurring, what manufacturer has the most, and how many fatalities are there as a result. For my project, I am using the Airplane Crash Data Since 1908 dataset, which is a collection of the number of crashes from 1908 till 2019.
#My data
crashes <- read.csv("../data/airplanedata.csv")
head(crashes)
## Date Time Location Operator
## 1 9/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army
## 2 9/7/1909 Juvisy-sur-Orge, France
## 3 7/12/1912 6:30 Atlantic City, New Jersey Military - U.S. Navy
## 4 8/6/1913 Victoria, British Columbia, Canada Private
## 5 9/9/1913 18:30 Over the North Sea Military - German Navy
## 6 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy
## Flight.. Route AC.Type Registration cn.ln Aboard
## 1 Demonstration Wright Flyer III 1 2
## 2 Air show Wright Byplane SC1 1
## 3 Test flight Dirigible 5
## 4 Curtiss seaplane 1
## 5 Zeppelin L-1 (airship) 20
## 6 Zeppelin L-2 (airship) 30
## Aboard.Passangers Aboard.Crew Fatalities Fatalities.Passangers
## 1 1 1 1 1
## 2 0 1 1 0
## 3 0 5 5 0
## 4 0 1 1 0
## 5 NULL NULL 14 NULL
## 6 NULL NULL 30 NULL
## Fatalities.Crew Ground
## 1 0 0
## 2 0 0
## 3 5 0
## 4 1 0
## 5 NULL 0
## 6 NULL 0
## Summary
## 1 During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge, 26, who was a passenger. This was the first recorded airplane fatality in history. One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. Orville Wright suffered broken ribs, pelvis and a leg. Selfridge suffered a crushed skull and died a short time later.
## 2 Eugene Lefebvre was the first pilot to ever be killed in an air accident, after his controls jambed while flying in an air show.
## 3 First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
## 4 The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed.
## 5 The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.
## 6 Hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
#Separating the Date column into 3 different ones to perform a better analysis
df <- separate(data = crashes,col = Date, into = c("Month", "Day", "Year"), sep = "\\/")
#Selecting the columns that I need to perform my analysis
date_df <- select(df, c(Month, Year, Fatalities, Aboard, AC.Type))
#Changing the data type of fatalities to a numeric
date_df$Fatalities <- as.numeric(date_df$Fatalities)
## Warning: NAs introduced by coercion
#Changing all the NAs to 0s
date_df[is.na(date_df)] <- 0
#Changing the data type of year to a numeric
date_df$Year <- as.numeric(date_df$Year)
#Creating a new column in our dataset called decade
date_df <- date_df %>%
mutate(Decade = floor(Year/10) * 10)
#Summing all the fatalities per Year
year_df <- date_df %>%
group_by(Year) %>%
summarise(Fatalities = sum(Fatalities))
#Summing all the fatalities per Decade
decade_df <- date_df %>%
group_by(Decade) %>%
summarise(Fatalities = sum(Fatalities))
#Creating a new dataset, that has no NAs
df_country <- crashes
df_country[is.na(df_country)] <- ""
#grouping them by Location, and tallying them up, and selecting the locations that had more than 10 crashes
df_country1 <- df_country %>%
group_by(Location) %>%
tally() %>%
filter(n >= 10)
#splitting the location couling into city, region, and country
df_country1$Loc <- df_country1$Location
df_country1 <- separate(data = df_country1, col = Location, into = c("City", "Region", "Country"), sep = "\\,")
df_country1[is.na(df_country1)] <- ""
#Sub-dataset
small_DF <- select(crashes, c(AC.Type, Aboard))
#Changing Aboard to a numeric
small_DF$Aboard <- as.numeric(small_DF$Aboard)
## Warning: NAs introduced by coercion
small_DF <- small_DF %>%
#Getting the first word of the manufacturer to use for later analysis
separate(col = AC.Type, into = c("Manufacturer"), sep = "\\ ") %>%
#grouping by the manufacturer
group_by(Manufacturer) %>%
#adding up the total aboard per manufacturer
summarise(x = Aboard) %>%
arrange(desc(x)) %>%
#filtering out flights that had less than 250 because I felt that would skew the results
filter(x > 250)
## Warning: Expected 1 pieces. Additional pieces discarded in 4896 rows [1, 2, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, ...].
## `summarise()` has grouped output by 'Manufacturer'. You can override using the `.groups` argument.
small_DF <- small_DF %>%
group_by(Manufacturer) %>%
summarise(x = sum(x)) %>%
arrange(desc(x)) %>%
filter(x > 250)
small_DF
## # A tibble: 5 x 2
## Manufacturer x
## <chr> <dbl>
## 1 Boeing 6590
## 2 McDonnell 3853
## 3 Airbus 1990
## 4 Lockheed 1214
## 5 Ilyushin 532
#Changing the data type of Aboard to a numeriv
date_df$Aboard <- as.numeric(date_df$Aboard)
## Warning: NAs introduced by coercion
df_fatalityRate <- date_df %>%
separate(col = AC.Type, into = c("Manufacturer"), sep = "\\ ") %>%
#filtering for manufacturers that are in the dataset I created in the last chunk
filter(Manufacturer %in% small_DF$Manufacturer) %>%
group_by(Manufacturer, Year) %>%
#summing the totals across all flights per manufacturer
#I did it this way as I believe that this is more accurate then taking the average.
summarise(total_Fatalities = sum(Fatalities), total_Aboard = sum(Aboard)) %>%
#calculating the fatality rate
mutate(Fatality_Rate = (total_Fatalities/total_Aboard) * 100)
## Warning: Expected 1 pieces. Additional pieces discarded in 4896 rows [1, 2, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, ...].
## `summarise()` has grouped output by 'Manufacturer'. You can override using the `.groups` argument.
df_fatalityRate
## # A tibble: 294 x 5
## # Groups: Manufacturer [5]
## Manufacturer Year total_Fatalities total_Aboard Fatality_Rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Airbus 1976 7 258 2.71
## 2 Airbus 1987 5 5 100
## 3 Airbus 1988 293 426 68.8
## 4 Airbus 1990 92 146 63.0
## 5 Airbus 1992 367 376 97.6
## 6 Airbus 1993 2 70 2.86
## 7 Airbus 1994 349 752 46.4
## 8 Airbus 1995 60 60 100
## 9 Airbus 1997 234 234 100
## 10 Airbus 1998 298 469 63.5
## # ... with 284 more rows
#creating the plot
ggplot() +
#adding the line that represents fatalities per year
geom_line(year_df, mapping = aes(x = Year, y = Fatalities), color = "Blue")+
#adding the line that represents fatalities per decade
geom_line(decade_df, mapping = aes(x = Decade, y = Fatalities/10), color = "Red")+
#adding title, subtitle, and caption
labs(title = "Fatalities by Year and Decade",
subtitle = "Worldwide Airplane Fatalities from 1908 till 2019",
caption = "Data Source: Kaggle.com") +
geom_smooth()+
#making the dual axis, and adding commas to the y-axis scales
scale_y_continuous(
name = "Annual Fatalities",
labels = scales::comma,
sec.axis = sec_axis(~.*10, name = "Decade Fatalities", labels = scales::comma))+
theme_minimal() +
#making the coloring of the y-axis match the color of their respective lines
theme(axis.text.y.left = element_text(color = "blue")) +
theme(axis.text.y.right = element_text(color = "red")) +
#positioning the title left aligned, not aligned with the plot
theme(plot.title.position = "plot")
#For this graph, I tried adding the colors to the title of the graph to match the lines, but i could not figure it out.
In this graph, you can see that based on the years, fatalities peaked just before 1975. It then had a slight decrease. Based off of the decades, it is showing the peak fatalities around the same time, but it shows that overall the Fatalities were steadly decreasing.
df_country1 %>%
ggplot(aes(x = reorder(Loc, n), y = n, fill = Region)) +
geom_col() +
scale_fill_brewer(palette = "Blues")+
coord_flip() +
labs(title = "Regions with the Most Crashes",
y = "Number of Crashes",
x = "") +
geom_text(aes(label = n), hjust = -0.5)+
theme_minimal() +
guides(fill = FALSE) +
theme(plot.title.position = "plot")
> This graph was shows the regions with the most common plane crashes. I think it is intresting that Russia has a decent size gap over New York. I say decent because as I was looking at this graph I noticed that there was a location called “near Moscow”, so adding that to the other bar of Russia, it has a total of 28 plane crashes.
#The commented out lines didn't work. I believe it has to do with me using plotly. But if I wasnt, I would have left aligned my title, and added a caption.
f <- df_fatalityRate %>%
ggplot(data = df_fatalityRate, mapping = aes(x = Manufacturer, y = Year, fill = Fatality_Rate))+
labs(title = "Fatality Rate by Manufacturer") +
#caption = "Data Source: kaggle.com") +
geom_tile() +
guides(fill = FALSE)
#theme(plot.title.position = "plot")
ggplotly(f)
This graph is showing the top manufacturers, and the fatality rates per year. The darker the color, the lower the fatality rate. I mention this because I want to compare Boeing and Lockheed Martin. They are two major companies and as you can see Lockheeds fatality rate is higher. (I guess that is a good thing, as boeing supplies the commericial aircraft.)
I really enjoyed exploring this dataset, and I wish I had more time. Some ideas that I wish to add:
operator of the flight could have effected the outcome of the flight. I noticed there was a lot of military operators, both foreign and domestic.summary of the crash to see how many were truly accidents vs. planned attacks.